27 research outputs found
Memory vectors for similarity search in high-dimensional spaces
We study an indexing architecture to store and search in a database of
high-dimensional vectors from the perspective of statistical signal processing
and decision theory. This architecture is composed of several memory units,
each of which summarizes a fraction of the database by a single representative
vector. The potential similarity of the query to one of the vectors stored in
the memory unit is gauged by a simple correlation with the memory unit's
representative vector. This representative optimizes the test of the following
hypothesis: the query is independent from any vector in the memory unit vs. the
query is a simple perturbation of one of the stored vectors.
Compared to exhaustive search, our approach finds the most similar database
vectors significantly faster without a noticeable reduction in search quality.
Interestingly, the reduction of complexity is provably better in
high-dimensional spaces. We empirically demonstrate its practical interest in a
large-scale image search scenario with off-the-shelf state-of-the-art
descriptors.Comment: Accepted to IEEE Transactions on Big Dat
Improving Image Recognition by Retrieving from Web-Scale Image-Text Data
Retrieval augmented models are becoming increasingly popular for computer
vision tasks after their recent success in NLP problems. The goal is to enhance
the recognition capabilities of the model by retrieving similar examples for
the visual input from an external memory set. In this work, we introduce an
attention-based memory module, which learns the importance of each retrieved
example from the memory. Compared to existing approaches, our method removes
the influence of the irrelevant retrieved examples, and retains those that are
beneficial to the input query. We also thoroughly study various ways of
constructing the memory dataset. Our experiments show the benefit of using a
massive-scale memory dataset of 1B image-text pairs, and demonstrate the
performance of different memory representations. We evaluate our method in
three different classification tasks, namely long-tailed recognition, learning
with noisy labels, and fine-grained classification, and show that it achieves
state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.Comment: Accepted to CVPR 202
A Memory Transformer Network for Incremental Learning
We study class-incremental learning, a training setup in which new classes of
data are observed over time for the model to learn from. Despite the
straightforward problem formulation, the naive application of classification
models to class-incremental learning results in the "catastrophic forgetting"
of previously seen classes. One of the most successful existing methods has
been the use of a memory of exemplars, which overcomes the issue of
catastrophic forgetting by saving a subset of past data into a memory bank and
utilizing it to prevent forgetting when training future tasks. In our paper, we
propose to enhance the utilization of this memory bank: we not only use it as a
source of additional training data like existing works but also integrate it in
the prediction process explicitly.Our method, the Memory Transformer Network
(MTN), learns how to combine and aggregate the information from the nearest
neighbors in the memory with a transformer to make more accurate predictions.
We conduct extensive experiments and ablations to evaluate our approach. We
show that MTN achieves state-of-the-art performance on the challenging
ImageNet-1k and Google-Landmarks-1k incremental learning benchmarks
Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations
CVPR 2017Query expansion is a popular method to improve the quality of image retrieval with both conventional and CNN representations. It has been so far limited to global image similarity. This work focuses on diffusion, a mechanism that captures the image manifold in the feature space. The diffusion is carried out on descriptors of overlapping image regions rather than on a global image descriptor like in previous approaches. An efficient off-line stage allows optional reduction in the number of stored regions. In the on-line stage, the proposed handling of unseen queries in the indexing stage removes additional computation to adjust the precomputed data. We perform diffusion through a sparse linear system solver, yielding practical query times well below one second. Experimentally, we observe a significant boost in performance of image retrieval with compact CNN descriptors on standard benchmarks, especially when the query object covers only a small part of the image. Small objects have been a common failure case of CNN-based retrieval
Marchés agricoles, Prix. 3/1994. = "Agricultural Markets, Prices. 3/1994.
Despite the success of deep learning on representing images for particular object retrieval, recent studies show that the learned representations still lie on manifolds in a high dimensional space. This makes the Euclidean nearest neighbor search biased for this task. Exploring the mani-folds online remains expensive even if a nearest neighbor graph has been computed offline. This work introduces an explicit embedding reducing manifold search to Euclidean search followed by dot product similarity search. This is equivalent to linear graph filtering of a sparse signal in the frequency domain. To speed up online search, we compute an approximate Fourier basis of the graph offline. We improve the state of art on particular object retrieval datasets including the challenging Instre dataset containing small objects. At a scale of 10 5 images, the offline cost is only a few hours, while query time is comparable to standard similarity search
AVIS: Autonomous Visual Information Seeking with Large Language Model Agent
In this paper, we propose an autonomous information seeking visual question
answering framework, AVIS. Our method leverages a Large Language Model (LLM) to
dynamically strategize the utilization of external tools and to investigate
their outputs, thereby acquiring the indispensable knowledge needed to provide
answers to the posed questions. Responding to visual questions that necessitate
external knowledge, such as "What event is commemorated by the building
depicted in this image?", is a complex task. This task presents a combinatorial
search space that demands a sequence of actions, including invoking APIs,
analyzing their responses, and making informed decisions. We conduct a user
study to collect a variety of instances of human decision-making when faced
with this task. This data is then used to design a system comprised of three
components: an LLM-powered planner that dynamically determines which tool to
use next, an LLM-powered reasoner that analyzes and extracts key information
from the tool outputs, and a working memory component that retains the acquired
information throughout the process. The collected user behavior serves as a
guide for our system in two key ways. First, we create a transition graph by
analyzing the sequence of decisions made by users. This graph delineates
distinct states and confines the set of actions available at each state.
Second, we use examples of user decision-making to provide our LLM-powered
planner and reasoner with relevant contextual instances, enhancing their
capacity to make informed decisions. We show that AVIS achieves
state-of-the-art results on knowledge-intensive visual question answering
benchmarks such as Infoseek and OK-VQA.Comment: Published on NeurIPS 202
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory
In this paper, we propose an end-to-end Retrieval-Augmented Visual Language
Model (REVEAL) that learns to encode world knowledge into a large-scale memory,
and to retrieve from it to answer knowledge-intensive queries. REVEAL consists
of four key components: the memory, the encoder, the retriever and the
generator. The large-scale memory encodes various sources of multimodal world
knowledge (e.g. image-text pairs, question answering pairs, knowledge graph
triplets, etc) via a unified encoder. The retriever finds the most relevant
knowledge entries in the memory, and the generator fuses the retrieved
knowledge with the input query to produce the output. A key novelty in our
approach is that the memory, encoder, retriever and generator are all
pre-trained end-to-end on a massive amount of data. Furthermore, our approach
can use a diverse set of multimodal knowledge sources, which is shown to result
in significant gains. We show that REVEAL achieves state-of-the-art results on
visual question answering and image captioning
Mémoires continues représentant des ensembles de vecteurs et des collections d’images
In this thesis, we study the indexing and query expansion problems in image retrieval. The former sacrifices the accuracy for efficiency, whereas the latter takes the opposite perspective and improves accuracy with additional cost. Our proposed solutions to both problems consist of utilizing continuous representations of a set of vectors. We turn our attention to indexing first, and follow the group testing scheme. We assign each dataset vector to a group, and represent each group with a single vector representation. We propose memory vectors, whose solution is optimized under the membership test hypothesis. The optimal solution for this problem is based on Moore-Penrose pseudo-inverse, and shows superior performance compared to basic sum pooling. We also provide a data-driven approach optimizing the assignment and representation jointly. The second half of the transcript focuses on the query expansion problem, representing a set of vectors with weighted graphs. This allows us to retrieve objects that lie on the same manifold, but further away in Euclidean space. We improve the efficiency of our technique even further, creating high-dimensional diffusion embeddings offline, so that they can be compared with a simple dot product in the query time. For both problems, we provide thorough experiments and analysis in well-known image retrieval benchmarks and show the improvements achieved by proposed methods.Cette thèse étudie l'indexation et le mécanisme d'expansion de requête en recherche d'image. L'indexation sacrifie la qualité de la recherche pour une plus grande efficacité; l'expansion de requête prend ce compromis dans l'autre sens : il améliore la qualité de la recherche avec un coût en complexité additionnel. Nous proposons des solutions pour les deux approches qui utilisent une représentation continue d'un ensemble de vecteurs. Pour l'indexation, notre solution est basée sur le test par groupe. Chaque vecteur image est assigné à un groupe, et chaque groupe est représenté par un seul vecteur. C'est la représentation continue de l'ensemble des vecteur du groupe. L'optimisation de cette représentation pour produire un bon test d'appartenance donne une solution basée sur la pseudo-inverse de Moore-Penrose. Elle montre des performances supérieures à celles d'une somme basique des vecteurs du groupe. Nous proposons aussi une alternative suivant au plus près les vecteurs-images de la base. Elle optimise conjointement l'assignation des vecteurs images à des groupes ainsi que la représentation vectorielle de ces groupes. La deuxième partie de la thèse étudie le mécanisme d'expansion de requête au moyen d'un graphe pondéré représentant les vecteurs images. Cela permet de retrouver des images similaires le long d'une même variété géométrique, mais éloignées en distance Euclidienne. Nous donnons une implémentation ultra-rapide de ce mécanisme en créant des représentations vectorielles incorporant la diffusion. Ainsi, le mécanisme d'expansion se réduit à un simple produit scalaire entre les représentations vectorielles lors de la requête. Les deux parties de la thèse fournissent une analyse théorique et un travail expérimental approfondi utilisant les protocoles et les jeux de données standards en recherche d'images. Les méthodes proposées ont des performances supérieures à l'état de l'art
Mémoires continues représentant des ensembles de vecteurs et des collections d’images
In this thesis, we study the indexing and query expansion problems in image retrieval. The former sacrifices the accuracy for efficiency, whereas the latter takes the opposite perspective and improves accuracy with additional cost. Our proposed solutions to both problems consist of utilizing continuous representations of a set of vectors. We turn our attention to indexing first, and follow the group testing scheme. We assign each dataset vector to a group, and represent each group with a single vector representation. We propose memory vectors, whose solution is optimized under the membership test hypothesis. The optimal solution for this problem is based on Moore-Penrose pseudo-inverse, and shows superior performance compared to basic sum pooling. We also provide a data-driven approach optimizing the assignment and representation jointly. The second half of the transcript focuses on the query expansion problem, representing a set of vectors with weighted graphs. This allows us to retrieve objects that lie on the same manifold, but further away in Euclidean space. We improve the efficiency of our technique even further, creating high-dimensional diffusion embeddings offline, so that they can be compared with a simple dot product in the query time. For both problems, we provide thorough experiments and analysis in well-known image retrieval benchmarks and show the improvements achieved by proposed methods.Cette thèse étudie l'indexation et le mécanisme d'expansion de requête en recherche d'image. L'indexation sacrifie la qualité de la recherche pour une plus grande efficacité; l'expansion de requête prend ce compromis dans l'autre sens : il améliore la qualité de la recherche avec un coût en complexité additionnel. Nous proposons des solutions pour les deux approches qui utilisent une représentation continue d'un ensemble de vecteurs. Pour l'indexation, notre solution est basée sur le test par groupe. Chaque vecteur image est assigné à un groupe, et chaque groupe est représenté par un seul vecteur. C'est la représentation continue de l'ensemble des vecteur du groupe. L'optimisation de cette représentation pour produire un bon test d'appartenance donne une solution basée sur la pseudo-inverse de Moore-Penrose. Elle montre des performances supérieures à celles d'une somme basique des vecteurs du groupe. Nous proposons aussi une alternative suivant au plus près les vecteurs-images de la base. Elle optimise conjointement l'assignation des vecteurs images à des groupes ainsi que la représentation vectorielle de ces groupes. La deuxième partie de la thèse étudie le mécanisme d'expansion de requête au moyen d'un graphe pondéré représentant les vecteurs images. Cela permet de retrouver des images similaires le long d'une même variété géométrique, mais éloignées en distance Euclidienne. Nous donnons une implémentation ultra-rapide de ce mécanisme en créant des représentations vectorielles incorporant la diffusion. Ainsi, le mécanisme d'expansion se réduit à un simple produit scalaire entre les représentations vectorielles lors de la requête. Les deux parties de la thèse fournissent une analyse théorique et un travail expérimental approfondi utilisant les protocoles et les jeux de données standards en recherche d'images. Les méthodes proposées ont des performances supérieures à l'état de l'art